Reinforcement learning is widely used for dialogue policy optimization wherethe reward function often consists of more than one component, e.g., thedialogue success and the dialogue length. In this work, we propose a structuredmethod for finding a good balance between these components by searching for theoptimal reward component weighting. To render this search feasible, we usemulti-objective reinforcement learning to significantly reduce the number oftraining dialogues required. We apply our proposed method to find optimizedcomponent weights for six domains and compare them to a default baseline.
展开▼